Card specifications: 


Clock frequency: 1.5 GHz (Estimated) 
Peak Performance: 1.5 TFlops 
Number of transistors: 3.0 Billions 
Total Number of FP32 Cuda Core: 512 
Total Number of FP64 Cuda Core: 256 


Global memory clock: 4 GHz 
DRAM Bandwith : 192 GB/s 
Max DRAM : 6 GB 

DRAM Type: GDDR5 

L2 Unified Cache: 768KB 
Number of SMs: 16 

Number of TPCs: NA 


Streaming Multiprocessor (SM) specifications: 


* Number of CUDA Cores per SM: 32 
e Number of FP32 Cuda Cores per SM: 32 
Or | Number of FP64 Cuda Cores per SM: 16 
Number of Tensor Core per SM: NA 
Number of TU: 4 
Number of SFUs per SM: 4 
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Number of LD/ST per SM: 16 
Number of Warp Schedulers: 2 

L1 Cache / Shared Memory: 64KB 
Shared Memory: 32KB of 32bits 
Registers: 32KB of 32bits 


KEPLER 


Card specifications: Global memory clock: 6 GHz 
DRAM Bandwith : 192 GB/s 


Clock frequency: 1.1 GHz Max DRAM :4GB 

Peak Performance: 3.1 TFlops DRAM Type: GDDR5 
Number of transistors: 3.5 Billions L2 Unified Cache: 768KB 
Total Number of FP32 Cuda Core: 1536 Number of SMs: 8 SMX 
Total Number of FP64 Cuda Core: X ¢ Number of TPCs: NA 


Streaming Multiprocessor (SM) specifications: 

* Number of CUDA Cores per SM: 192 ¢ Number of LD/ST per SM: 32 

* Number of FP32 Cuda Cores per SM: 192 ¢ Number of Warp Schedulers: 4 
Number of FP64 Cuda Cores per SM: 96 * L1 Cache / Shared Memory: up to 128KB 
Number of Tensor Core per SM: NA * Shared Memory: up to 128KB of 32bits 
Number of TU: 16 e Registers: up to 128KB of 32bits 
Number of SFUs per SM: 32 
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MAXWELL 


Card specifications: 


Clock frequency: 1.1 GHz 
Peak Performance: 4,6 TFlops 
Number of transistors: 8.1 Billions 


Total Number of FP32 Cuda Core: 3072 
Total Number of FP64 Cuda Core: 96 . 


Year 2014 | 


Global memory clock: 1.7 GHz 
DRAM Bandwith : 336 GB/s 
Max DRAM : 12 GB 

DRAM Type: GDDR5 

L2 Unified Cache: 2MB 
Number of SMs: 24 SMM 
Number of TPCs: NA 


Streaming Multiprocessor (SM) specifications: 
* Number of CUDA Cores per SM: 128 . 


Number of FP64 Cuda Cores per SM: 4 . 


* Number of FP32 Cuda Cores per SM: 128 g 
Or 


Number of Tensor Core per SM: NA e 


Number of TU: 8 
Number of SFUs per SM: 32 
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resistance 


Number of LD/ST per SM: 32 

Number of Warp Schedulers: 4 

L1 Cache / Shared Memory: up to 128KB 
Shared Memory: up to 128KB of 32bits 
Registers: up to 256KB of 64bits 


retreat cost 


PASCAL Year 2016 (oe 


Card specifications: Global memory clock: 1.4 GHz 
DRAM Bandwith : 750 GB/s 


Clock frequency: 1.4 GHz Max DRAM : 16 GB 
Peak Performance: 12 TFlops DRAM Type: GDDR5X 
Number of transistors: 15.6 Billions L2 Unified Cache: 4MB 
Total Number of FP32 Cuda Core: 3840 Number of SMs: 60 
Total Number of FP64 Cuda Core: 1920 Number of TPCs: 30 


Streaming Multiprocessor (SM) specifications: 
Number of CUDA Cores per SM: 64 * Number of LD/ST per SM: 16 
Number of FP32 Cuda Cores per SM: 64 ¢ Number of Warp Schedulers: 2 
Number of FP64 Cuda Cores per SM: 32 orf: L1 Cache / Shared Memory: up to 64KB of 32bits 
Number of Tensor Core per SM: NA * Shared Memory: up to 64KB of 32bits 
Number of TU: 4 e Registers: 64KB of 32bits 
Number of SFUs per SM: 16 
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Card specifications: Global memory clock: 2 GHz 
DRAM Bandwith : 900 GB/s 
Clock frequency: 1.5 GHz Max DRAM : 6GB 


Peak Performance: 15.7 TFlops DRAM Type: HBM2 


Number of transistors: 21.1 Billions * L2 Unified Cache: 6MB 
Total Number of FP32 Cuda Core: 5120 e Number of SMs: 84 


Total Number of FP64 Cuda Core: 2560 * Number of TPCs: 42 


Streaming Multiprocessor (SM) specifications 
Number of CUDA Cores per SM: 32 * Number of LD/ST per SM: 32 
Number of INT Cuda Cores per SM : 64 * Number of Warp Schedulers: 4 
Number of FP32 Cuda Cores per SM: 64 e L1 Cache / Shared Memory: 128KB 
Number of FP64 Cuda Cores per SM: 32 e Registers: 16K x 32 bits 
Number of Tensor Core per SM: 8 
Number of TU: 4 (TEX) 
Number of SFUs per SM: 16 
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TURING 


Card specifications: 
Clock frequency: 1.6 GHz 
Peak Performance: 1.5 TFlops 
Number of transistors: 18.6 Billions 
Total Number of FP32 Cuda Core: 4608 
Total Number of Tensor Core: 576 
Total Number of Ray Tracing Core: 72 


Global memory clock: 2 GHz 
DRAM Bandwith : 672 GB/s 
Max DRAM : 11GB 

DRAM Type: GDDR6 

L2 Unified Cache: 512KB 
Number of SMs: 72 

Number of TPCs: 36 


Streaming Multiprocessor (SM) specifications 


Number of CUDA Cores per SM: 64 
Number of FP32 Cuda Cores per SM: 64 
Number of FP64 Cuda Cores per SM: 32 
Number of Tensor Core per SM: 8 
Number of TU: 4 

Number of SFUs per SM: 4 


Number of LD/ST per SM: 16 
Number of Warp Schedulers: 2 

L1 Cache / Shared Memory: 64KB 
Shared Memory: 16K or 48KB 
Registers: 4*16K of 32 bits 
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AMPERE Year 2020 ( 


Card specifications: Global memory clock: 2.4 GHz 


Clock frequency: 1.6 GHz DRAM Bandwith : 1.6TB/s 
Peak Performance (FP32): 19.5 TFlops Max DRAM : 40GB 
Number of transistors: 54 Billions DRAM Type: HBM2 

Total Number of FP32 Cuda Core: 6912 L2 Unified Cache: 40MB 
Total Number of FP64 Cuda Core: 3456 Number of SMs: 128 
Total Number of Tensor Core: 432 Number of TPCs: 64 


Streaming Multiprocessor (SM) specifications 
Number of CUDA Cores per SM: 108 Number of SFUs per SM: 4 
Number of INT32 Cuda Cores per SM: 64 Number of LD/ST per SM: 32 
Number of FP32 Cuda Cores per SM: 64 Number of Warp Schedulers: 4 
Number of FP64 Cuda Cores per SM: 32 L1 Cache / Shared Memory: 192KB 
Number of Tensor Core per SM: 4 Registers: 16K x 32 bits 
Number of TU: 4 (TEX) 
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KERNEL 


struct Vec3 { float x, y, Z;}; 


__global__ void my_kernel( const Vec3* a, 
Vec3 b, 
float* c) 


int i = threadID; 


c[i] = afi].x * b.x 
+ afi].y * by 
+ afi].z * b.z; 


} 


dim3 DimGrid(100,50) ; // 100*50*1 = 5000 blocks 
dim3 DimBlock(4,8,8) ; // 4*8*8 = 256 threads / blocks 


my_kernel<<< DimGrid, DimBlock >>> (...) ; 


Definition: 
A thread is a computation unit (function) that has a state and that can be paused and 
resumed that will be executed on the GPU or on the CPU. 


You have 3 types of kernels: 

° __global__ : called by CPU but executed by GPU 
. __device__ : called and executed by GPU 

° __host__: called and executed by CPU 


Calling kernel is made this way : kernel <<< nBlocs, threadsPerBloc >>> (arguments); 
¢ nBlocs : size the thread grid to use 
¢ ThreadsPerBloc : number of threads to execute simultaneously on each block 
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THREAD 


Executed by 


Definition: 
A GPU Thread is an instantiation of a function over a given data in a GPU Kernel (__global__ 


or__device_). 
For parallel computing : 1 thread = 1 function application over 1 data. 
Typically, each thread in a kernel will compute one element of an array. There is a common 


pattern to do this that most CUDA programs use are shown below. 
Once a kernel is launched, it’s dimensions can’t change 


Memory: Local Memory 


Each thread has its own private local memory 
Only exists for the lifetime of the thread 
Generally handled automatically by the compiler 
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THREAD BLOCK 


Thread Block 


Thread Block Streaming Multiprocessor 


Definition: 
A thread block is a programming abstraction representing a group of threads that can be executed 
serially or in parallel. For better process and data mapping, threads are grouped into thread blocks. 


The number of threads in block varies with available shared memory. The threads in the same thread 
block run on the same stream processor. Threads in the same block can communicate with each other 


via shared memory, barrier synchronization or other synchronization primitives such as atomic 
operations. 
Thread ID is unique within a block, Each block can execute in any order relative to other blocks. 


Memory: Shared Memory 

Each thread block has its own shared memory accessible only by threads within the block 
Much faster than local or global memory 

Requires special handling to get maximum performance 

Only exists for the lifetime of the block 


weakness resistance 
Copyright, 2019 OVH 


Kernel Grid 


| Executed by 


Definition: 
Multiple thread blocks are combined to form a grid. All the blocks in the same grid contain 
the same number of threads. Grids can be used for computations that require a large 
number of thread blocks to operate in parallel. 


The number of thread blocks in a grid is usually dictated by the size of the data being 
processed or the number of processors in the system, which it can greatly exceed. 


All threads in a grid execute the same kernel function. 
All blocks in a grid have the same dimensions. 


Memory: Global Memory 


e This memory is accessible to all threads as well as the host (CPU). 
¢ Global memory is allocated and deallocated by the host 
¢ Used to initialize the data that the GPU will work on 
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Logical view Hardware view Execution 


CONTROL LOGIC 


Thread Block Multiprocessor 


Definition: 


Warp are giving GPU the ability to execute the same application code on hardware with 
different number of execution resources is called transparent scalability. Warp are like the 
Software to Hardware translator. 


A hardware design can exploit the commonality of the threads belonging to a warp by 
combining their memory accesses and assuming that it is fine to pause and resume all the 
threads at the same time, rather than deciding on a per-thread basis. 

The warp size is the number of threads running concurrently on an Multi-Processor. 

Warps are managed by warp scheduler that will orchestrate the execution of the Thread blocks 


on the physical architecture meaning Multi-core Units aka CUDA/RT/Tensor Cores. 
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128KB L1 Data Cache / Shared Memory 


Tex Tex 


Definition: 
Streaming Multiprocessor (SM) is the part where the magic happens. This designed was first 
introduced in 2010 with Fermi and was derived with SMX with Kepler (2012) and SMM with 
Maxwell (2014) but was reintroduced since 2016 with Pascal and Volta (2017). 


It's composed of : 

e Scheduling tools (Dispatch Units, Warp Schedulers) 

e. Memory (LO, L1 Cache) 

e Register File : that will link main memory data and computation components residing in Multi- 
core units 

e Multi-core units : that will perform the calculations but also components that will manage the 
memory flows between Memory units and computation cores 
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Source : https://research.nvidia.com/sites/default/files/pubs/2012-12_Unifying-Primary-Cache/Gebhart_MICRO_2012.pdf 
http://www. irisa.fr/alf/downloads/collange/cours/gpuprog ufmg/gpuprog_1.pdf 


Streaming Multiprocessor X (SMX) >< 
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64 KB Shared Memory / L1 Cache 
48 KB Read-Only Data Cache 


Definition: 


Streaming Multiprocessor X (SMX) is a variation of SM. 


The main difference with SM is that NVIDIA tried at one point to reduce the number of SM and to make 
bigger SM. Basically SMX are SM under steroid in terms of number of cores but might be less efficient 
if you consider that the shared resources/cores are reduced. However packing everything like this 
saves space and leave room for more transistors (therefore cores) on the same GPU surface ... Still, 
the trade off is interesting and was introduced with Kepler Micro-architecture (2012). 


Just as SM, SMX are composed of : 

e Scheduling tools (Dispatch Units, Warp Schedulers) 

* Memory (LO, L1 Cache) 

* Register File : that will link main memory data and computation components residing in Multi-core 
units 

e Multi-core units : that will perform the calculations but also components that will manage the 
memory flows between Memory units and computation cores 
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Streaming Multiprocessor M (SMM) 4c) 


itecture. 


Definition: 


Streaming Multiprocessor M (SMM) is a variation of SM and SMX used for Maxwell Micro-architecture 
(2014). 


If SMX are SM under steroid. One would describe SMM as a well balanced body building diet along with a 
small dose of steroid. The number of cores for the SMM is still higher that usual SM however the 
drawback of the ultra compact SMX design due to not so good ratio of available shared resources per 
core is more balanced in SMM with 4 subsections having their own dedicated shared resources such as 
dispatch Unit , instruction buffer, and warp schedulers. 


Just as SM and SMX, SMM are composed of : 

* Scheduling tools (Dispatch Units, Warp Schedulers) 

* Memory (LO, L1 Cache) 

e Register File : that will link main memory data and computation components residing in Multi-core 
units 

e Multi-core units : that will perform the calculations but also components that will manage the memory 


flows between Memory units and computation cores 
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— 
pecial Function Unit (SFU) 


LO Instruction Cache LO Instruction Cache 
Warp Scheduler (32 thread/clk) Warp Scheduler (32 thread/clk) 
Dispatch Unit (32 thread/clk) Dispatch Unit (32 thread/clk) 


Register File (16,384 x 32-bit) Register File (16,384 x 32-bit) 
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128KB L1 Data Cache / Shared Memory 
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| Definition: 


Execute transcendental instructions such as sin, cosine, 
reciprocal, and square root. Each SFU executes one 
instruction per thread, per clock; a warp executes over 
eight clocks. The SFU pipeline is decoupled from the 
dispatch unit, allowing the dispatch unit to issue to other 
execution units while the SFU is occupied. 
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— 
Texture Unit (Text/TMU) 


LO Instruction Cache LO Instruction Cache 
Warp Scheduler (32 thread/clk) Warp Scheduler (32 thread/clk) 
Dispatch Unit (32 thread/clk) Dispatch Unit (32 thread/clk) 


Register File (16,384 x 32-bit) Register File (16,384 x 32-bit) 
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128KB L1 Data Cache / Shared Memory 
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Definition: 


A TMU is able to rotate, resize, and distort a bitmap image (performing texture 
sampling), to be placed onto an arbitrary plane of a given 3D model as a texture. This 


process is called texture mapping. 


In the past TMU were separated physically from the SM but the Fermi Micro- 
Architecture introduced it as a component in the SM making it part of the GPGPU 


strategy. 
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ire_Mapping.htm 
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Definition: 


To feed the computation cores it’s needed at one point to fetch data from the 
memory (L1 cache data ) and push it to the cores. This is called load and store 
instructions and it’s handled by the SM LD/ST units. 


LD/ST units operate on the register which size vary from one micro-architecture 
to another. Memory accesses are managed at each clock operations covering X- 
bytes block splitted over X memory addresses 


Reading the memory for all ALU assigned in blocks operations (thanks to a 


warp) can take multiple cycles depending on memory address, core and LD/ST 
width. 
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— 
INT ALU (Half precision) 


LO Instruction Cache LO Instruction Cache 
Warp Scheduler (32 thread/clk) Warp Scheduler (32 thread/clk) 
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Register File (16,384 x 32-bit) Register File (16,384 x 32-bit) 
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Definition: 


Floating Point Unit provide the capability to GPU to perform Fused Multiple Add 
instructions (FMA or Fused Multiply Accumulate - FMAC) but also addition, 
multiplication or divisions. Special/Complex operations are handled by the SFU. 


INT or HP (stands for Half Precision) ALU (Arithmetic Logical Unit) are performing 
FMA over 16 Bits elements 


— 
FP32 ALU (Single Precision) 
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Definition: 


Floating Point Unit provide the capability to GPU to perform Fused Multiple Add 
instructions (FMA or Fused Multiply Accumulate - FMAC) but also addition, 
multiplication or divisions. Special/Complex operations are handled by the SFU. 


FP32 or SP (stands for Single Precision) ALU (Arithmetic Logical Unit) are 
performing FMA over 32 Bits elements 


In the last GPU Generations starting from Pascal the FP32 units were also able to 
process Half Precision (HP) FP16. 


— 
FP64 ALU (Double Precision) 
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Definition: 
Floating Point Unit provide the capability to GPU to perform Fused 
Multiple Add instructions (FMA or Fused Multiply Accumulate - FMAC) 
but also addition, multiplication or divisions. Special/Complex 
operations are handled by the SFU. 


FP64 or DP (stands for Double Precision) ALU (Arithmetic Logical 
Unit) are performing FMA over 64 Bits elements 


double 
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Tensor Core 
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Definition: 


Tensor cores are pretty new to GPGPUs as it was introduced in 2017 with Volta Micro-architecture. 


Tensor cores were introduced in 2017 with Volta Micro-architecture. As graphical rendering is all about 4x4 


matrices as objects have x,y,z and rotation which makes object representation being referred as 4x4x4 
matrices. To perform graphical rendering for an object you need to have the object in its referential, then 


move it to the real world referential and finally project it into the “camera” referential (clipping). Everything is 
just about Multiply and accumulate 4x4x4 matrices. This is also perfect for Deep Learning applications 
(https://www.ovh.com/fr/blog/deep-learning-explained-to-my-8-year-old-daughter/) 


Easy enough the big thing with TensorCore is the smart way it was implemented as it’s performing mixt 
precision calculations as presented below... does this operation remind you of something dear Al 


programmers #ConvolutionNeuralNetworks A 
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Definition: 


With Ray Tracing core the Unified shaders architecture is now being questioned. As explained in the 
“Fermi Microarchitecture card” GPU prior to GPGPU (started with Fermi) were designed with hardware 
specifications corresponding to image rendering pipeline. 

By implementing RT Cores we are (partially) going back to the good old day of image rendering 
pipelines encoded into hardware where pixel and vertex shaders are separated. 


Ray Tracing is a computing technic to emulate the light effects in image rendering. RT Core / RTX is a 
combination of Ray Tracing mathematical calculation combined with intuitive light effect prediction 
using Deep Learning Super Sampling (DLSS) executing on... Tensor cores. All of this should lead to 
augmented rasterization by using denoising and upsampling. 
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